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Abstract 


Stacked denoising auto encoders (DAEs) are well known to learn useful deep rep¬ 
resentations, which can be used to improve supervised training by initializing a 
deep network. We investigate a training scheme of a deep DAE, where DAE lay¬ 
ers are gradually added and keep adapting as additional layers are added. We show 
that in the regime of mid-sized datasets, this gradual training provides a small but 
consistent improvement over stacked training in both reconstruction quality and 
classification error over stacked training on MNIST and CIEAR datasets. 


1 Gradual training of denoising autoencoders 


We test here gradual training of deep denoising auto encoders, training the network layer-by-layer, 
but lower layers keep adapting throughout training. To allow lower layers to adapt continuously, 
noise is injected at the input level. This training procedure differs from stack-training of auto en¬ 
coders ( [Vincent et al.||2010l l 

More specifically, in gradual training, the first layer of the deep DAE is trained as in stacked training, 
producing a layer of weights Wi. Then, when adding the second layer autoencoder, its weights W 2 
are tuned jointly with the already-trained weights wi. Given a training sample x, we generate a 
noisy version x, feed it to the 2-layered DAE, and compute the activation at the subsequent layers 
hi = Sigmoid{wJx), /12 = Sigmoid{wJhi) and y = Sigmoid{w'^h 2 ). Importantly, the loss 
function is computed over the input x, and is used to update all the weights including wi. Similarly, 
if a 3'^'^ layer is trained, it involves tuning wi and W 2 in addition to W 3 and 


2 Experimental procedures 


We compare the performance of gradual and stacked training in two learning setups; an unsuper¬ 
vised denoising task, and a supervised classification task initialized using the weights learned in an 
unsupervised way. Evaluations were made on three benchmarks; MNIST, CIEAR-10 and CIEAR- 
100, but only show here MNIST results due to space constraints. We used a test subset of 10,000 
samples and several sizes of training-set all maintaining the uniform distribution over classes. 

Hyper parameters were selected using a second level of cross validation, including the learning rate, 
SGD batch size, momentum and weight decay. In the supervised experiments, training was ’early 
stopped’ after 35 epochs without improvement. The results reported below are averages over 3 
train-validation splits. Since gradual training involves updating lower layers, every presentation of a 
sample involves more weight updates than in a single-layered DAE. To compare stacked and gradual 
training on a common ground, we limited gradual training to use the same budget of weight update 
steps as stacked training. 

Eor example, when training the second layer for n epochs in gradual training, we allocate 2n training 
epochs for stacked training (details in the full paper). 
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Figure 1: Unsupervised and 
supervised training results 
on MNIST dataset. Er¬ 
ror bars are over 3 train- 
validation splits. Network 
has 2 hidden layers with 
1000 units each (a) Recon¬ 
struction error of unsuper¬ 
vised training methods mea¬ 
sured by cross-entropy loss. 
The shown cross-entropy er¬ 
ror is relative to the mini¬ 
mum possible 


error, computed as the cross-entropy error of the original uncorrupted test set with itself. All com¬ 
pared methods used the same budget of update operations. Images were corrupted with 15% mask¬ 
ing noise. The U‘ hidden layer is trained for 50 epochs. Total epoch budget for the 2"“* hidden 
layer is 80 epochs, (b) Classification error of supervised training initialized based on DAEs. Each 
curve shows a different pre-training type. Text labels show the percentage of error improvement of 
Stacked-vs-Gradual 0 pretraining compared to Stacked-vs-Gradual 1 pretraining. 


3 Results 

We evaluate gradual and stacked training in unsupervised task of image denoising, and then evalu¬ 
ated the quality of the two methods for initializing a network in a supervised learning task. 

Unsupervised learning for denoising. We first evaluate gradual training in an unsupervised task 
of image denoising. Here, the network is trained to minimize a cross-entropy loss over corrupted 
images. In addition to stacked and gradual training, we also tested a hybrid method that spends 
some epochs on tuning only the second layer (as in stacked training), and then spends the rest of the 
training budget on both layers (as in gradual training). We define the Stacked-vs-Gradual fraction 
0 < / < 1 as the fraction of weight updates that occur during stacked-type training. / = 1 is 
equivalent to pure stacked training while / = 0 is equivalent to pure gradual training. Given a 
budget of n training epochs, we train the 2"“* hidden layer with gradual training for n(l — /) epochs, 
and with stacked training for 2nf epochs. More specifically, since stacked training tunes a single 
layer of weights and gradual training tunes two layers of weights, we selected the number of stacked 
epochs s, and the number of gradual epochs g, such that s -\- 2g = n, and changed several values of 
s and g to get different ratios / = 1 — ^. 

Eigure[T^ shows the test-set cross entropy error when training 2-layered DAEs, as a function of the 
Stacked-vs-Gradual fraction. Pure gradual training achieved significant lower reconstruction error 
than any mix of stacked and gradual training with the same budget of update steps. 

Gradual-training DAE for initializing a network in a supervised task. We further tested the 
benefits of using the DAEs trained in the previous experiment for initializing a deep network in a 
supervised classification task. We initialized the first two layers of the deep network with the weights 
of the SDAE and added a classification layer on top with output units matching the classes in the 
dataset, with randomly initialized weights. 

To quantify the benefit of gradual unsupervised pretraining we trained these networks on subsets of 
the training set. Eigure. traces the classification error as a function of training set size, demon¬ 
strating a consistent but small improvement when using gradual training over stacked training (text 
legends). This effect is mostly relevant for datasets with less than SOiF samples. Similar results 
were obtained using CIEARIO and CIEARIOO. 
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